folditfandomcom-20200222-history
PDB matching
The PDB or Protein Data Bank, contains over 150,000 three-dimensional structures of proteins and related molecules. Looking at the PDB can be useful for Foldit players. This article describes how to interpret some of the information found in the PDB, especially with regard to how things are numbered. It's a companion to the recipe PDBReader 0.9 which partially interprets a PDB entry and optionally applies it to a Foldit protein. Although Foldit's rules forbid direct copying from the PDB or similar sources, the PDB can act as a source of inspiration for design puzzles. When "just browsing" the PDB, the way the residues are numbered is not generally a major concern. Foldit revisiting puzzles generally involve solved proteins that can be found in the PDB. Direct copying is still out of bounds, but players are free to look at the solutions. Each of the puzzles in the revisiting puzzle master list has a guide page which shows the primary structure of the protein. In addition to the primary structure, the guide page also shows a list of PDB entries which match the puzzle protein. Sometimes, residue 1 of the PDB protein corresponds to segment of the Foldit protein. In other cases, the two proteins match, but there may be additional segments in the PDB protein, so the numbering won't be the same. The examples below show how to interpret these offsets between Foldit and the PDB. Matching to the PDB - top level The primary structure is really the only way for players to match a protein to the PDB. The primary structure is just the sequence of amino acids that make up the protein. The most common way to show primary structure is a sequence of single-letter amino acid codes, also known as FASTA format. There are tools which can match proteins based on their tertiary structure, or 3D shape, but Foldit doesn't give players access to this level of detail. The first step in PDB matching is to get the primary structure. Several Foldit recipes can help with that step. The primary structure string can then be used to search online databases. Recipes In Foldit, the recipes print protein 2.8, AA Edit 2.0, and AA Copy Paste Compare v 1.1.1 -- Brow42 can each retrieve the primary structure as a FASTA-style string. (True FASTA format has a header, which these recipes don't create, and limits line length to 80 characters, just in case punch cards come back into style.) With the sequence in hand, or better yet, in the clipboard, several online search tools are available. One possibility is to go directly to the PDB, another possibility is to use a different tool, such as Jpred. Many other tools provide similar capabilities as well. PDB - direct One method for matching is to go directly the PDB, at rcsb.org, and paste the primary structure string into the search box. For example, given the primary structure of the TCR Binding Protein revisiting puzzle: dvmweykwentgdaelygpftsaqmqtwvsegyfpdgvycrkldppggqfynskridfdlyt a simple search of the PDB returns one hit, to 1L2Z. For a better search of the PDB, select Advanced Search, then under the "Choose a Query Type" dropdown, select Sequence Features. Paste the primary structure into the sequence field, and search again. This search returns a longer list of proteins, including these matches: *4BWS *1SYX *1GYF *1L2Z These matches cover 100% of the Foldit protein. Some of these entries have more than one chain matching the Foldit protein. Also, the matching chain in each entry may be longer than the Foldit protein. The PDB advanced search also returns two very partial matches: *3FMA *3K3V These matches don't cover 100% of the Foldit protein. Only small sections of the proteins match the Foldit protein. However, the protein descriptions mention a key "GYF domain", and it just so happens that one of the matching sections is "GYF" (or "gyf") meaning a sequence of glycine-tyrosine-phenylalanine. Both 3FMA and 3K3V also have a "GPTF" ("p" is for proline) section near the GYF, and also matching the Foldit protein. So the small sections of 3FMA and 3K3V might still have something to say about the shape of the Foldit protein, despite the weak match. Jpred The Jpred tool, at http://www.compbio.dundee.ac.uk/jpred/index.html, can do a couple of related tasks. It works with the same primary structure string used by the PDB and other sites. For a totally unknown protein, such as the ones seen in (most) de-novo puzzles in Foldit, Jpred acts as a secondary structure prediction tool. Jpred displays the likely locations of helixes and sheets, and also predicts sharp turns or bends in the protein. Before attempting a secondary structure prediction, Jpred also searches for known proteins that match the specified primary structure, in whole or in part. Given the primary structure of the TCR Binding Protein, Jpred returns the same list of good matches of as an advanced search of the PDB, except that Jpred treats each matching chain separately, so you may see two or three matches for the same PDB id. Jpred also does not show the weaker matches that the PDB search did. Under "Alignment of PDB hits to your sequence", the "Click to show/hide details" button shows exactly how each "hit" matches the specified sequence. For this particular protein, all the matches report: Identities = 62/62 (100%), Positives = 62/62 (100%) which means that the PDB sequence information matches the Foldit protein exactly. "Identities" are cases where the PDB entry has the same amino acid as the specified sequence. "Positives" are cases where the PDB entry has an amino acid that's closely related to the one in the specified sequence. For example, glutamate ("E" or "e") and glutamine ("Q" or "q") have a similar structure, different in only one atom at the end of their sidechains. Glutamate and glutamine can be considered a positive when they appear at the same relative spot in a primary structure sequence. Both Jpred and the PDB show a line indicate the common features of two sequences. A plus ("+") sign is used to indicate positives. Especially for Foldit revisiting puzzles, 100% identity, a complete and exact match, is the starting point. Matching to the PDB - deep dive The PDB contains lots of complex information. As with any complex system that has grown over time, various inconsistencies and redundancies have arisen, and are now somewhat "baked in" (or perhaps "highly conserved"). As a result, even a seemingly simple exact match can start to look a little problematic. offsets In some of the examples for the TCR Binding Protein shown earlier, segment 1 of the Foldit protein isn't residue 1 of the matching PDB protein. For example, the chains C and F of the 4BWS protein match the Foldit protein, but segment 1 of the Foldit "query" is segment 10 of the PDB "subject". This type of offset is common, and most often just means that the PDB protein is longer than the one in Foldit. In the case of 4BWS, just subtracting 9 from the PDB residue number reported in the match gives you the Foldit segment number. This case appears simple, until you consider that 4BWS contains multiple chains. The PDB numbers residues just like Foldit numbers segments. So chains C and F won't begin a low-numbered residue like 1 or 10. DBREF The classic PDB file, with its punchcard format, contains records like SHEET and ATOM that deliver data fields in specific columns. The DBREF record is found in the header section of a PDB file. There should be at least one DBREF record for each chain in the protein. Among other things, the DBREF record indicates the starting sequence number of the chain, in the field "seqBegin", columns 15-18. The DBREF record for chain C of 4BWS: DBREF 4BWS C 280 341 UNP O95400 CD2B2_HUMAN 280 341 indicates that the chain begins with residue 280. Other records, such as SHEET, HELIX, and SSBOND (containing disulfide bridge info), will use these numbers, so they'll need to be adjusted accordingly. This is another offset, similar to the one seen in the initial PDB matches. In this case, it's a large offset, -279. The nice thing about this offset is that it eliminate the need for the -10 offset from the initial match. The extra 10 residues have seemingly vanished. sequence information The PDB contains primary sequence information in various places. For convenience, the PDB has a FASTA format version of the sequence available. For chain C of 4BWS, the FASTA file looks like this: >4BWS:C|PDBID|CHAIN|SEQUENCE MAHHHHHHMDVMWEYKWENTGDAELYGPFTSAQMQTWVSEGYFPDGVYCRKLDPPGGQFYNSKRIDFDLYT The initial MAHHHHHHM is a what's known as a "His-tag" or more properly a Polyhistidine-tag, an "expression tag" or artificial identifier added to help identify a protein grown in the lab from artificial DNA. The PDB file for 4BWS contains SEQADV (sequence advice) records that explain what happened the numbering: SEQADV 4BWS MET C 271 UNP O95400 EXPRESSION TAG SEQADV 4BWS ALA C 272 UNP O95400 EXPRESSION TAG SEQADV 4BWS HIS C 273 UNP O95400 EXPRESSION TAG SEQADV 4BWS HIS C 274 UNP O95400 EXPRESSION TAG SEQADV 4BWS HIS C 275 UNP O95400 EXPRESSION TAG SEQADV 4BWS HIS C 276 UNP O95400 EXPRESSION TAG SEQADV 4BWS HIS C 277 UNP O95400 EXPRESSION TAG SEQADV 4BWS HIS C 278 UNP O95400 EXPRESSION TAG SEQADV 4BWS MET C 279 UNP O95400 EXPRESSION TAG The tag is simply numbered as residues 271-279, with the actual protein starting at segment 280, corresponding to segment 1 as seen in Foldit. The FASTA file is not really the primary source of sequence information. The PDB file contains SEQRES (sequence result) records that specify the sequence. As with other PDB records, SEQRES uses three-character amino acid codes, so "MET" instead of "M" and "ALA" instead of "A", but also "TYR" instead of "Y" and "W" instead of "TRP". For chain C of 4BWS, the SEQRES records look like this: SEQRES 1 C 71 MET ALA HIS HIS HIS HIS HIS HIS MET ASP VAL MET TRP SEQRES 2 C 71 GLU TYR LYS TRP GLU ASN THR GLY ASP ALA GLU LEU TYR SEQRES 3 C 71 GLY PRO PHE THR SER ALA GLN MET GLN THR TRP VAL SER SEQRES 4 C 71 GLU GLY TYR PHE PRO ASP GLY VAL TYR CYS ARG LYS LEU SEQRES 5 C 71 ASP PRO PRO GLY GLY GLN PHE TYR ASN SER LYS ARG ILE SEQRES 6 C 71 ASP PHE ASP LEU TYR THR The expression tag is there - "MET ALA HIS HIS HIS HIS HIS HIS MET", just as if it was part of the chain. But the DBREF records indicates that these AAs are not really part of the chain. Just for the sake of redundancy, the sequence information appears in yet another spot in the PDB file. The ATOM records in the body of the PDB file give the 3D position of each atom. They include an atom number, but also a segment number and an amino acid (residue) code. The atom records for the aspartate that the first non-expression-tag residue of 4BWS, chain C look like this: ATOM 1300 N ASP C 280 32.107 13.062 36.375 1.00 43.68 N ATOM 1301 CA ASP C 280 32.495 14.201 35.527 1.00 40.04 C ATOM 1302 C ASP C 280 33.540 15.140 36.156 1.00 31.78 C ATOM 1303 O ASP C 280 33.740 15.150 37.374 1.00 27.53 O ATOM 1304 CB ASP C 280 31.265 14.989 35.071 1.00 41.70 C ATOM 1305 CG ASP C 280 30.086 14.824 36.013 1.00 49.08 C omissions and origins In the previous example, we saw that numbering based on sequence information, as seen in the FASTA file and the SEQRES records, doesn't necessarily correspond to the numbering in the rest of the PDB file. For chain C of 4BWS, the expression tag is present, but the tag actually falls into a gap in the DBREF records. Chain B ends at residue 265, and chain C begins at residue 280. The residues in between the two chains may or may not be present in the ATOM records. (In fact, chain B ends with the ATOM records for residue 259, and C starts with the ATOM records for chain 276.) . This kind of flexibility in numbering can go further. Chain A of 4WBS starts with another expression tag. The SEQADV records start numbering for this tab at -4: SEQADV 4BWS MET A -4 UNP P83876 EXPRESSION TAG SEQADV 4BWS ALA A -3 UNP P83876 EXPRESSION TAG SEQADV 4BWS HIS A -2 UNP P83876 EXPRESSION TAG SEQADV 4BWS HIS A -1 UNP P83876 EXPRESSION TAG SEQADV 4BWS HIS A 0 UNP P83876 EXPRESSION TAG SEQADV 4BWS HIS A 1 UNP P83876 EXPRESSION TAG SEQADV 4BWS HIS A 2 UNP P83876 EXPRESSION TAG SEQADV 4BWS HIS A 3 UNP P83876 EXPRESSION TAG The ATOM record is actually for residue 5 in 4WBS, but nothing in the PDB prevents the use of zero or negative numbers. As the PDB-101 tuorial Primary Sequences and the PDB Format states: In many cases, you may find that the coordinates presented in ATOM records in a PDB file may not exactly match the sequence in the SEQRES records. The ends of chains and mobile loops are often not observed in crystallographic experiments, and coordinates are not included as ATOM records in the file. (...) (...) The numbering of residues can also provide an additional complication. In some cases, the researchers number the ATOM records based on the numbering of the whole protein, while in other cases, they number the chain based on the fragment. Any number (negative, 0, positive) can be used. Category:Biochemistry Category:Glossary